This is an exploratory data analysis of quality of red and white wines by their physicochemical properties such as alcohol level, density and pH. The goal is to find which physicochemical properties correlate to wine quality. We first explore the distribution of individual variables, then their relationships and finally summarise the findings.
This dataset lists 1599 and 4898 instances of red and white wine respectively, each containing 11 variables on the chemical properties of the wine and a quality score based on sensory data - median of at least 3 evaluations made by wine experts.
From the 11 variables in the dataset, it was found that only alcohol had a moderate correlation to quality in both red and white wines. For red wines, acetic acid concentration had also demonstrated a weak correlation to quality.
Wine industry is a lucrative industry which is growing as wine is getting more popular and information about it is more widely acessible. There are many factors that may affect the perceived taste and quality of wine. Among these factors, physicochemical properties of the wine, such as alcohol and sugar levels, pH and chlorides may play an important role
Here, we’ll analyse a dataset related to red and white variants of the Portuguese “Vinho Verde” wine, exploring which (if any) chemical properties influence their quality.
This dataset lists 1599 and 4898 instances of red and white wine respectively, each containing 11 variables on the chemical properties of the wine and a quality score based on sensory data - median of at least 3 evaluations made by wine experts. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). Below, we describe each of the attibutes registered in each instance:
1 - fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides (sodium chloride - g / dm^3): the amount of salt in the wine
6 - free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol (% by volume): the percent alcohol content of the wine
12 - quality (score between 0 and 10)
It’s important to note: several of the attributes may be correlated (e.g. density and alcohol level).
To start our analysis, let’s firts load up all necessary packages and our dataset. I’ll create a dataset called “wines” from two separate datasets, one for red wines and the other one for white wines. I’ll also set quality as an ordered factor variable.
I already covered in the Indroduction all the features in this dataset, but here we have more details about its structure with a list of its columns, the data type of each one of them and some sample values. We have a total of 6497 observations with 14 variables (of which only 11 are physicochemical).
## 'data.frame': 6497 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ type : chr "red" "red" "red" "red" ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
##
## alcohol quality type
## Min. : 8.00 3: 30 Length:6497
## 1st Qu.: 9.50 4: 216 Class :character
## Median :10.30 5:2138 Mode :character
## Mean :10.49 6:2836
## 3rd Qu.:11.30 7:1079
## Max. :14.90 8: 193
## 9: 5
The median fixed acidity is 7 g/L and the median volatile acidity is 0.29 g/L. Median citric acid contration is 0.31 g/L and the median pH is 3.2, with all wines being acid (pH < 4). Most wines in this samples aren’t considered sweet, with 75% of them having a residual sugar concentrations below 8.1 g/L. Median Chloride (salt) concentration for these wines is 0.05 g/L. Sulphates concentration is 0.53 g/L on average and total sulfur dioxide and its free version medians are 118 ppm and 29 ppm, respectively. Density varies very little, ranging from 0.98 to 1.03 g/mL, being on average 0.99 g/mL. The mean alcohol volume is 10.49 % and most wines have a quality rating of 5 and 6, with few of them getting a rate of less than 5.
We’ll begin by exploring the distribution and patterns of individual variables and how their distribution varies across types of wine. First, we’ll generate plots and then discuss the patterns and interesting finds.
For some variables such as chlorides, density and total sulfur dioxide I had to exclude outliers from plots, since they usually created distortations that made it harder to spot overall traits.
As quality is our main feature of interest, we’ll start with it and then explore all other features in order of expected relevence for the differences in quality rating.
## 3 4 5 6 7 8 9
## 10 53 681 638 199 18 0
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most red wines were rated 5 with a minimum rating of 3 and maximum of 8. White wines had generally better ratings, with most of them being rated 6 and 5 of them being rated 9, the highest rating among this group. 67% of white wines have ratings greater than 5 in comparison with only 53% of red wines.
Quality rating distribution appears to be normal for both red and white wines, though the distribution for red wines seems to be wider. Here we can clearly see that most wines in the sample are white.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol concentration distribution is very similar for red and white wines. Both range from about 8% to 14% of volume and appear to be bimodal, with a peak below 10% (about 9.5%) and a smaller one above 10% (about 11% for reds and 12.5% for whites). Alcohol distribution seems wider for white than for red wines, and the white’s mean is slighly higher than red’s, being 10.51% and 10.41% respectively. Reds have 3 outliers above 13.5% while whites have no outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
White wines have more sugar than red wines on average. The mean residual sugar concentration for white is 6.4 g/L while for red is 2.5 g/L, almost 3x less. And the max concentration for white, 65.8 g/L is much higher than for red, 15.5 g/L. One interesting fact is that the median and 3rd quantile of red wines are very near each other, being 2.2 g/L and 2.6 g/L respectively.
Residual Sugar distribution differs greatly for red and wine whites. While both appear positively skwed, red distribution is concentrated around 2 g/L and white is much wider, with a greater diversity of sugar levels. With the boxplots specially, we can notice the highly concentrated distribution of red wines when compared to white wines, which has a interquantile range almost 12x wider than red wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Average chloride concentration is higher in red wines, being almost double the concentration in white wines - 0.087 g/L against 0.045 g/L.
The concentration for both wine types appear normally distributed. Through the histogram and density plot we can notice how the distribution of red presents a higher mean, as it’s shifted to the right of the white distribution. Looking at the boxplots, we can notice that the distribition of red wines is slightly wider than the distribution of white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density varies very little among wines, with most of them having a density between 0.98 g/mL and 1 g/ml.
Density distribution seems normal and more concentrated for red wines while positively skewed and wider for white wines. The distribution for reds has only one noticeable peak, while it seems trimodal for whites - with one peak for each peack in alcohol level, maybe.
Fixed Acidity, which is given by the concentration of tartaric acid is normally distributed in both wine types, and peaks at about 7 g/L. Its distribution for red wines is a bit higher, wider and more skewed than for white wines.
Volatile Acidity, which is given by acetic acid concentration, is generally lower (mean at about 0.25 g/L) and normally distributied (with a posive skew) for white wines. The distribution for red wines seems wider and bimodal.
Citric acid concentration varies around 0.3 g/L and presents a trimodal distribution for both wine types. While almost all white wines contain some level of citric acid, a significant portion of red wines have a 0 g/L.
pH for both wine types is normally dstributed around a mean of about 3.25. The distribution for red wines is slightly higer than for white wines.
Sulphates concentration is very similar to both red and white wines, both normally distributed and with peaks between 0.4 and 0.6 g/L. The only noticeable difference is that concentration is generally lower for white wines.
Total and and Free Sulfur Dioxide concentration distribution is more positively skewed for red wines, while white wines have a wider normal-like distribution. Free sulfur dioxide distribution presents lower values than total sulfur dioxide distribution, which makes sense since the free form derives from the total gas concentration. One important difference between types is that total and free sulfur dioxide concentration is generally higher for white wines. Most of them have a concentration that makes the gas evident in the nose and taste of wine (above the threshold of 50 ppm), while most red wines don’t.
One interesting fact with these variables is that they present many outliers, with values as high as about 3x the mean.
From the analysis of individual variables distributions, the most relevant facts I found were:
Now we’ll explore relationships between the features in the dataset. I’ll look at relashionships pairs of variables and start by the relationship of quality with other variables.
Based on the paired plots above, I decided to have a look at bivariate plots of the variables that seemed most correlated with quality, so we could get a better defition of their relationships.
As quality is an ordered factor, I decided to create boxplots of the variables as functions of quality rating, and to add a line connecting the median values for each quality rating - so we can see trends.
Here we can see the relationship between alcohol level and quality. Though not through a straight line, we can notive that quality is generally positively correlated to alcohol level for both red and white wines. Alcohol seems to be the variable with the greates correlation to quality.
Here, we can notice that quality has apparently a strong negative correlatation with volatile acidty in red wines. Quality rating doesn’t seem to be affected by volatile acidity in white wines, though.
Citric acid seems to be positively correlated to quality but only in red wines as well. Citric acid concentration is stable across quality groups in white wines.
In the boxplots above, we can see a small but noticeable negative correlation between quality and salt in red wines, and a stronger negative correlation in white wines.
In the first boxplot row above, we can see the relationship between density and quality. We can spot a negative correlation between desity and quality in both wine types, but a stronger one for white wines.
The second row shows us how quality varies with pH. Here, an interesting tred emerges: though not strongly, both wine types seem to correlate with pH. But while for red wines the correlatation seems negative, for white wines it seems positive.
And lastly, we have the relationship between sulphates and quality. Although potassium sulphate concentration doesn’t vary a lot, we can notice a small increase in its concentration for higher quality scores in red wines.
## volatile.acidity citric.acid total.sulfur.dioxide
## volatile.acidity 1.0000000 -0.55249568 0.07647000
## citric.acid -0.5524957 1.00000000 0.03553302
## total.sulfur.dioxide 0.0764700 0.03553302 1.00000000
## sulphates -0.2609867 0.31277004 0.04294684
## alcohol -0.2022880 0.10990325 -0.20565394
## quality -0.3905578 0.22637251 -0.18510029
## sulphates alcohol quality
## volatile.acidity -0.26098669 -0.20228803 -0.3905578
## citric.acid 0.31277004 0.10990325 0.2263725
## total.sulfur.dioxide 0.04294684 -0.20565394 -0.1851003
## sulphates 1.00000000 0.09359475 0.2513971
## alcohol 0.09359475 1.00000000 0.4761663
## quality 0.25139708 0.47616632 1.0000000
Here we have the correlation matrix for the 5 features with the highest correlation to quality in red wines. Important facts to notice are:
## volatile.acidity chlorides total.sulfur.dioxide
## volatile.acidity 1.00000000 0.07051157 0.0892605
## chlorides 0.07051157 1.00000000 0.1989103
## total.sulfur.dioxide 0.08926050 0.19891030 1.0000000
## density 0.02711385 0.25721132 0.5298813
## alcohol 0.06771794 -0.36018871 -0.4488921
## quality -0.19472297 -0.20993441 -0.1747372
## density alcohol quality
## volatile.acidity 0.02711385 0.06771794 -0.1947230
## chlorides 0.25721132 -0.36018871 -0.2099344
## total.sulfur.dioxide 0.52988132 -0.44889210 -0.1747372
## density 1.00000000 -0.78013762 -0.3071233
## alcohol -0.78013762 1.00000000 0.4355747
## quality -0.30712331 0.43557472 1.0000000
This is same correlation matrix as before, but now for white wines. Important facts to notice are:
As we’ve seen the paired plots and in theses matrices, many variables in the dataset correlate to each other. Some correlations are obvious, such as fixed acidity and pH, but others are less obvious such as the one between residual sugar and sulfur dioxide.
In this section we explore some of these interesing indirect correlations.
Through the scatter plots above, we can see that pH has a strong negative correlation with fixed acididy, and with citric acid concentration as well, though not that strongly. That makes sense, since the more acids in the wine, lower tends to be its pH. But one initially surprising and counter intuitive correlation is the positive correlation between volatile acidity and pH. One would naturally expect that volative acidity would decrease pH, since we tend to expect that volatile acidity increases with total acidity. But the relashionship is the opposite, maybe because as volatile acidity increases more acid tends to be released from the liquid and therefore the liquid becomes less acid.
As the scatter plots above show, residual sugar level is positively correlated to total and free sulful dioxide. As sulfur dioxide preventis microbial growth and the oxidation of wine, it makes sense for it to be in greater concentration in wines most vunerable to microbial growth, which I assume to be the wines with more sugar.
Residual sugar also strongly correlates with density. As sugar level increases, so does density. It makes sense, as the more sugar is diluted in the liquid, heavier it tends to get without increase in volume.
As the first scatter plot in this sequence shows, for white wines, residual sugar is negatively correlated to alcohol level. It makes sense if we consider that as alcohol prodcution through fermentation cosumes sugar, we’d have less sugar left where more alcohol was produced. In red wines, we see a very small increase in sugar level with higher alcohol levels. That seems to counter the facts mentioned before. But as the residual sugar range in red wines is very thin, I assume that red grapes generally have less sugar than white grapes, which forces producers to consume the maximum amount of sugar to produce wines and choose sweeter grapes to make more alcohol wines.
Alcohol also seems to be negatively related to salt level, though I can’t find a logic explation for it.
Total sulfur dioxide appears to be negatively correlated to alcohol both in white and red wines. This relatioship might be indirect, due to sugar. I’ve seen that more alcohol tend to lead to more sugar, which then tends to lead to sulfur dioxide.
Finally, we can see the strongest correlation found in the dataset: negatively between alcohol and density. As alcohol concentration increases, density decreases. That’s because alcohol is less dense than water and its share of volume increases, it tends to decrease the mean desity of the whole solution.
From the analysis of bivariate relationhips in this dataset, what I consider the most important findings are:
We’ve already uncovered the factors that were most correlated to quality in our dataset. But now, in order to see if there is some interesting relationship that could be uncovered by observing how quality varies along two other variables, we’ll plot the main features related to quality together and use color to distinguish their quality rating. I added elipses to the scatterplots that represent the 95% confidence interval of where datapoints were likely to fall for each quality rating.
Here we can see how both red and white wine quality varies with density and alcohol. As density decreases and alcohol level increases, quality tends to increase as well, as shown by the higher density of dark datapoints and the elipseses getting darker towards the bottom right of the plots.
We can also notice that red wines seem to be much less afted by density than white wines.
We can notice here that for red wines, quality tends to increase to the bottom right side of the chart. That indicates that quality tends to increase with alcohol level and decrease with volatile acidity in red wines. On the other hand, white wines seem to be almost unaffected by changes in the level of volatile acidity, varing just slightly negatively in relation to it.
Here we can observe that each wine type varies more notiably with only one of the two variables. Red wine quality increases slightly as citric acid concentration increases, as can be noticed by the increased density of darker datapoint upwards. And it looks like white wine quality increases as chloride concentration decreases, while keeping relatively unaffected by citric acid changes.
Finally, in this plot we see that red wine quality increases with sulphates concentration and tends be slightly decreased by total sulful dioxide concentration. On the other hand, there’s no noticeable trend between white wine quality and either sulphates or total sulfur dioxide concentrations.
To sum up our analysis, I decided to build simple linear models to predict quality rating of red and white wines from the features with the highest correlation to it.
For red wines I used alcohol level and volatile acidity as independent variables because they’re the 2 features with the highest correlation to quality, and the correlation between them is relatively low. Moreover, R2 didnt’increase by adding other features to the model.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = red)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red)
##
## ==========================================
## m1 m2
## ------------------------------------------
## (Intercept) 1.875*** 3.095***
## (0.175) (0.184)
## alcohol 0.361*** 0.314***
## (0.017) (0.016)
## volatile.acidity -1.384***
## (0.095)
## ------------------------------------------
## R-squared 0.2 0.3
## adj. R-squared 0.2 0.3
## sigma 0.7 0.7
## F 468.3 370.4
## p 0.0 0.0
## Log-likelihood -1721.1 -1621.8
## Deviance 805.9 711.8
## AIC 3448.1 3251.6
## BIC 3464.2 3273.1
## N 1599 1599
## ==========================================
The variables in this linear model can account for 30% of the variance in the quality rating of red wines.
Considering that quality is an integer, I decided to run the model rounding its predicted quality rating, so I could then compare it to the actual ratings. With this change, the model correctly predicted the quality of 906 instances from 1599 wines in the sample, yielding an accuracy rate of 57%.
## Mode FALSE TRUE NA's
## logical 693 906 0
## [1] 0.5666041
For white wines, I only used alcohol level to predict quality. When other features were added to the model, they made it worse, decreasing its R2.
##
## Calls:
## m1: lm(formula = round(quality, digits = 0) ~ alcohol, data = white)
##
## =============================
## (Intercept) 2.582***
## (0.098)
## alcohol 0.313***
## (0.009)
## -----------------------------
## R-squared 0.2
## adj. R-squared 0.2
## sigma 0.8
## F 1146.4
## p 0.0
## Log-likelihood -5839.4
## Deviance 3112.3
## AIC 11684.8
## BIC 11704.3
## N 4898
## =============================
And as we can see, alcohol in this linear model accounts for 20% of the variance in the quality rating of white wines.
To test the predictive power of the model for white wines, I tweaked it the same way, rounding its output. In this way, the model correctly predicted the quality of 2398 instances from 4898 white wines in the sample, yielding an accuracy rate of 49%.
## Mode FALSE TRUE NA's
## logical 2500 2398 0
## [1] 0.4895876
The main goal of this exploratory analysis was to understand how quality of red and white wines is impacted by physicochemical properties such as residual sugar, chlorides, pH, alcohol level, and more. A surprising finding is that most properties in the dataset don’t correlate strongly enough to quality, except for alcohol level. Alcohol presented a moderate correlation to quality for the alcohol range in the sample, from about 9% to 14% of volume. As can be seen in this plot, generally speaking, as alcohol volume increase from about 9% to 14%, median quality ratings (red lines) also increase for both red and white wines - this trend is less clear to quality rating from 3 to 5. And as can be seen above as well through the boxplots for each quality score, the alcohol values in interquantile range also tend to increase with quality score, mainly for wines with the best scores (7-9).
The main question derived from this fact is: why? From what I’ve read about alcohol and quality, experts don’t seem to have an intentional preference for alcohol levels from 11% to 14%. And as their judgement tend to be based on other properties such as wine texture, color, region, and so on, I assume that white and red wines that tend to posses valued characteristics in those other proporties may also have an alcohol level in the 11% to 14% range. So maybe alcohol has a strong correlation to other properties external to this dataset that are more commonly associated to quality, and that’s why alcohol level tend to increase with quality rating (for the alcohol range in the sample).
An important and interesting finding from the exploration is that the strongest relationships in the dataset were found among the features themselves instead of with quality. A striking example is the correlation between alcohol level and density, which has a coeficient of -0.78 for white wines. As can be notice in this scatterplot, wine density tends to decrease as alcohol level increases. This happens because alcohol’s density is lower than water’s density and therefore, as alcohol level increases, overall wine density tends to decrease. These “indirect” relationships were found among other features in the dateset such as citric acid and volatile acidity or fixed acidity and pH.
I see this as an indication of two possibilities:
the existence of confouding factors. Some of the physicochemical properties measured to form this dataset might be the result or input of the same chemical reaction, as an example. In this way, as they’re connected by the same process, their variation is likely connected;
bad features selection. Is it really useful to have variables that are clearly related, such as fixed acidity and pH, in the same dataset? or they’re redundant? If we’re to use them to predict quality score or understand what may cause a better quality score, having many variables that correlate to each other isn’t helpful.
Another important trend found in the dataset is that the impact each feature has in quality differs significantly for red and white wines.
As can be seen in this plot, the darker elipses (which represent the 95% confidence interval of where wines with higher quality rating might appear) tend to move horizontally to the right for both wine types. This indicates that the quality of both wine types increases with alcohol level.
On the other hand, the darker elipses tend to move vertically only for red wines, which indicates that only red wine is impacted by Acetic acid concentration. As acetic acid concentration increases, quality distribution remains virtually uniform for white wines. On the other hand, quality for red wines increases as acetic acid concentration decreases.
This is a common phenomena in this dataset. Other properties such as citric acidity or chlorides concentration also impact red and white wines in different ways.
In the beginning of the analysis it was interesting to discover how the distribution of physicochemical properties differed between red and white wines, specially the difference in residual sugar, sodium clhoride, density and total sulfur dioxide. One that was particularly intriguing is the tendency that white wines had to be better rated. 67% of white wines against 53% of red had ratings greater than 5. Does that mean experts preffer white wines? I don’t think we can quite say that, but this trend begs that question or leaves us wondering if white wine judges were the same for red wines and if they were biased towards white wines. One should expect such differences in distributions of red and white wine features, but it’s interesting to see them so tangibly.
While It was possible to create a linear model with accuracy rate of almost 60% for red wines and, more disappointingly, almost 50% for white wines, I felt frustrated to realize that most of the extreme ratings were not discovered (i.e. 3, 4 and 8, 9). And I ended up felling that these models might be just as good as guessing. Moreover, another frustrating fact is how few features correlated to quality. Only alcohol had a moderate correlation to quality in both wines, and the rest of the variables had weak or very weak correlations to quality. This was specially true for white wines, whose correlations were generally weaker than the ones found in red wines.
One fact that made me question the presence of some features in the dataset was the strong correlations some features had among themselves, such as alcohol and density or pH and fixed acidity. This indicates that some third missing variable might be causing the variation of these features and that some dependent variables (i.e. density) might be redundant.
Overall, I felt the feature set we had wasn’t that relevant for predicting quality rating. Experts judgement might indeed depend on taste, but it seems that the physicochemical variables in this dataset, exept from alcohol, weren’t strong enough to affect experts’s perception of taste. In addition to that, as we can assume from experience that the expert’s judgement is subjective, it may depend on other variables such as types of grape, production year, producer, country of origin, region, aging and price. I suspect that these variables are more strongly correlated to ratings than wine’s physicochemical properties. Nevertheless, it’s interesting to realize that you can still judge wine quality (to some extent) using physicochemical data without the need to actually taste it.
An interesting follow up project to this one would be to include the other variables mentioned above so we could create a predictive model of quality in red and white wines. This model could provide insights and guidance to data driven wine producers, so that they could manage their production to create wines with the right features to yield a high quality perception. That model would be a valuable asset and source of competitve advantage to wine producers who apply it.
N/A